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Elucidating the pathophysiology and molecular attributes of common disorders as well as developing targeted and ef- 
fective treatments hinges on the study of the relevant cell type and tissues. Pancreatic beta cells within the islets of 
Langerhans are centrally involved in the pathogenesis of both type 1 and type 2 diabetes. Describing the differentiated 
state of the human beta cell has been hampered so far by technical [low resolution microarrays] and biological limitations 
(whole islet preparations rather than isolated beta cells). We circumvent these by deep RNA sequencing of purified beta 
cells from 11 individuals, presenting here the first characterization of the human beta cell transcriptome. We perform the 
first comparison of gene expression profiles between beta cells, whole islets, and beta cell depleted islet preparations, 
revealing thus beta-cell-specific expression and splicing signatures. Further, we demonstrate that genes with consistent 
increased expression in beta cells have neuronal-like properties, a signal previously hypothesized. Finally, we find evidence 
for extensive allelic imbalance in expression and uncover genetic regulatory variants (eQTLs) active in beta cells. This first 
molecular blueprint of the human beta cell offers biological insight into its differentiated function, including expression of 
key genes associated with both major types of diabetes. 



[Supplemental material is available for this article.] 

Phenotypic differences among cell types, individuals, and pop- 
ulations (Stranger et al. 2007; Dimas et al. 2009; Nica et al. 2011) 
are determined by variation in gene expression. A substantial 
proportion of this variability is driven by DNA polymorphisms 
residing in regulatory elements proximal or distal to the affected 
genes (Price et al. 2011; Grundberg et al. 2012). Numerous such 
variants have now been mapped for a variety of tissues, high- 
lighting their tissue dependent properties and hence the acute 
need for expression profiling of a diverse panel of cell types (Nica 
et al. 2011; Grundberg et al. 2012). This became even more evident 
in the context of functionally elusive results from genome-wide 
association studies (GWAS), as transcript abundance has been 
shown to provide a direct and causal link between genotype and 
disease susceptibility (Emilsson et al. 2008; Nica et al. 2010). This 
connection has been mostly attainable in disease-relevant tissues, 
often in concordance with our present knowledge about the eti- 
ology of complex diseases (Nica et al. 201 1; Grundberg et al. 2012). 
With the substantial improvement in the accuracy and resolution 
of transcriptome profiling by direct RNA sequencing (RNA-seq) 
(Montgomery et al. 2010; Pickrell et al. 2010), it is now possible to 
explore these relations comprehensively in an unbiased manner, 
with no theoretical limitation for dynamic range of expression 
detection provided there is sufficient sequencing depth. 
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Insulin-secreting pancreatic beta cells within the islets of 
Langerhans have been consistently involved in the pathogenesis 
of diabetes via autoimmune mediated apoptosis (type 1 diabetes; 
TID) (Tisch and McDevitt 1996) or insulin deficiency (type 2 di- 
abetes; T2D) (Saltiel and Kahn 2001). The genetic landscape of 
both common forms of the disease has been substantially broad- 
ened, with now over 60 known loci robustly associated with either 
type 1 (Barrett et al. 2009) or type 2 diabetes (Morris et al. 2012). As 
already attested (Gaulton et al. 2010), regulatory changes will 
likely explain a proportion of these associations, but uncovering 
them is entirely dependent on first describing the transcriptional 
profile of the beta cell and understanding its genetic determinants. 
In this context, we interrogate here the human beta cell tran- 
scriptome in multiple whole-genome sequenced individuals and 
uncover beta-cell-specific features in the context of other pan- 
creatic endocrine cell types. 

Results 

Following ethical guidelines at the University Hospital in Geneva, 
we obtained human islets from 11 cadaveric pancreata from in- 
dividuals without documented diabetes (description in Supple- 
mental Table 1). The islet preparations were of high purity (mean ± 
SD: 84.6 ± 10.3%) as measured by dithizone staining, indicat- 
ing little contamination with exocrine tissue. The islet cells were 
sorted by fluorescence-activated cell sorting (FAGS) as previously 
documented (Parnaud et al. 2008) in order to obtain a highly pu- 
rified population of fully functional beta cells for each individual: 
86.7 ± 6.8% beta cell purity assessed by immunofluorescence 
staining for insulin (INS). We identified any contaminating alpha 
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and delta cells by costaining for glucagon (GCG) (4.4 ± 3.4%) and 
somatostatin (SST) (2.7 ± 0.8%). For five of the samples, we also 
generated beta cell depleted "nonbeta" populations consisting 
primarily of glucagon secreting alpha cells (59.8 ± 14.1%), with 
4.3 ± 2.6% beta cells. RNA was extracted immediately after FACS 
sorting from 23 cell preparations (11 beta, seven islet, and five 
nonbeta), cDNA libraries constructed (Thomas and Ansel 2010), 
and sequenced at very high depth. 

We obtained a median of 209 million total reads (paired end, 
49 bases) per sample (Supplemental Fig. la), attaining thus an 
exceptional transcriptome resolution. The reads were mapped to 
the latest version of the human genome assembly (GRCh37/hgl9) 
using BWA (Li and Durbin 2009) and subsequently filtered for 
mapping quality and correct pairing. On average, 75% of the fil- 
tered reads (median 121 million reads per sample) mapped to known 
exons annotated by GENCODE (version 10) (Harrow et al. 2006). We 
quantified all genes expressed in >90% of individuals in either of the 
three cell type preparations (N= 19,975), normalizing the read counts 
to exonic gene length and sequencing depth (reads per kilobase per 
million [RPKM] mapped reads) (Mortazavi et al. 2008). 

Principal component analysis (PC A) on RPKM units (Fig. lA) 
indicates that beta cells and islets cluster closely together and 
markedly separate from nonbeta cells, with INS, insulin-like 
growth factor 2 {IGF2), GCG, transthyretin (TTR), regenerating 
islet-derived 1 alpha (REGIA), and SST driving most of this 
separation (Supplemental Fig. 2). We notice a clear clustering of 
the islet-derived RNA-seq data in the context of 18 other tissues 
(obtained from Ambion's FirstChoice Human Total RNA Survey 
Panel and processed alike), with liver bearing most similarity to 
the islet samples. Unsurprisingly, islet purity influences gene 
expression (lowest purity preparation P786i clusters less well), 
yet this effect is not very large in our samples of overall high 
quality. To illustrate this, we quantified the proportion of true 
positives estimated from the enrichment of significant P-values 
(pil) (Storey and Tibshirani 2003) resulting after correlating each 
gene's RPKM with purity (pil = 0-0.2) (Supplemental Fig. 3). 

Overall, we observe a high ranked correlation between beta 
cell and islet-expressed genes (rho = 0.94) (Fig. IB), and we estimate 
that 87% of the variance in beta cell gene expression can be 
explained by using islet expression as proxy. Given the estimated 
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Figure 1 . High overall similarity between beta cell and islet gene expression. (A) PCA analysis of gene 
RPKMs for beta (N = 1 1 ), islet (N = 7), nonbeta (N = 5) preparations, and 1 8 other tissues from unrelated 
individuals. Beta cell and islet samples cluster together, separating from nonbetas. The other tissues 
cluster separately, with liver being the most similar to the islet-derived RNA-seq data. (B) Scatterplot of 
beta cell versus islet median RPKMs on logio scale. 



~ 75% beta cell composition of the human islet (Pisania et al. 2010) 
and the quality of the material used for this study, the similarity 
between these two cell types is not surprising. As expected, INS was 
by far the most abundantly transcribed gene, followed by INS-IGF2 
and IGF2, making up -38%, 10%, and 2% of the total nuclear beta 
cell transcriptome, respectively (Fig. 2A). The corresponding rela- 
tive percentages are lower in the islet, by approximately twofold 
{INS 21%, INS-IGF2 5.8%, IGF2 1.1%). 

To uncover transcriptional signatures specific to beta cells, we 
next tested for differential gene expression after appropriate sta- 
tistical modeling of the raw counts (DESeq) (Anders and Huber 
2010). When comparing the beta cell sequence count data to 18 
non-islet tissues, we discovered 2980 differentially expressed (DE) 
genes (—12% of genes tested) at 10% false discovery rate (FDR), 417 
of which were significantly overexpressed in beta cells. These genes 
underlie the general features of endocrine pancreatic activity 
(DAVID) (Huang et al. 2009; Supplemental Table 2, enrichment 
results). We next sought to identify beta-cell-specific genes only in 
the context of the islet and its other hormone-producing cell types, 
while controlling for islet purity (included as covariate). Of the 
19,975 genes tested, we found 5555 DE genes between the beta cell 
and islet preparations and 4380 DE genes when comparing beta 
and nonbeta cells (Table lA). The difference in power is due to 
the smaller sample size of the nonbeta population and its greater 
variance in composition of cell types. We find in both cases a larger 
proportion (approximately two-thirds) of overexpressed genes 
in the more heterogeneous of the cell populations compared 
(log2 Fold Change > 0) (Fig. 2B,C). The remaining one-third of the 
DE genes are significantly overexpressed in beta cells, with smaller 
fold changes in islets as expected (log2 Fold Change mean ± SD: 
-1.06 ± 0.37 islet/beta, -1.64 ± 0.87 nonbeta/beta). Notably we 
observe a significant enrichment of annotated lincRNAs in beta 
cell overexpressed genes, corroborating their cell-type-specific 
expression properties reported previously in other tissues (Cabili 
et al. 2011): We detect 132 overexpressed lincRNAs in beta versus 
islet (1.59-fold enrichment over protein coding genes. Fisher's 
P-value: 1.1 X 10~^) and 148 overexpressed lincRNAs in beta versus 
nonbeta (1.9-fold enrichment. Fisher's P-value: 1.02 X 10~^). These 
are likely underestimates as we limited ourselves to noncoding RNAs 
currently present in the annotation (Moran et al. 2012). 

To portray the unique biological 
characteristics of beta cells in the islet 
context, we defined a stringent "beta cell 
specific" set of genes (N = 526) as the in- 
tersection of the beta cell overexpressed 
genes from the two pairwise comparisons 
(1987 beta-islet and 1583 beta-nonbeta) 
ranking in the following order of expres- 
sion: beta > islet > nonbeta. Similarly, we 
defined a set of "nonbeta cell specific" 
genes {N = 614) consistently and sig- 
nificantly underexpressed in beta cells 
(nonbeta > islet > beta). A functional an- 
notation analysis reveals that beta-cell- 
specific genes have neuron-like proper- 
ties (Table 2), a similarity noted earlier in 
studies implicating glucose-sensing hy- 
pothalamic neurons in nutrient homeo- 
stasis (Schwartz et al. 2000) or pancreatic 
hormone release (Ahren 2000). Nonbeta- 
cell-specific genes are mostly enriched 
for cell surface components (N-glycopro- 
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Figure 2. Expression differences between beta, islet, and nonbeta samples. (A) Dotchartof top 1 0 highest expressed genesand their contribution to the 
nuclear transcriptome by cell type. (B) Histogram of log2 Fold Change (islet/beta) for differentially expressed genes (10% FDR) in islets and beta cells. 
(C) Histogram of log2 Fold Change (nonbeta/beta) for differentially expressed genes (10% FDR) in nonbeta and beta cells. 



teins) central to both intercellular and cell-environment commu- 
nication (Danzer et al. 2012), secretory proteins (signal peptide), 
and extracellular matrix components. 

The two gene sets ("beta" and "nonbeta'O offer a more accu- 
rate depiction of the transcripts that define the molecular identity 
of the major islet cell types: Table 3 shows the top 30 beta and 
nonbeta cell enriched transcripts, respectively, in descending order 



Table 1. Differentially expressed (DE) genes and exon links 
between beta cells, islets, and nonbetas 



(A) Differentially expressed genes (10% FDR) — 19,975 genes tested 


Beta 


Beta 




overexpressed 


underexpressed 


Total 


Islet 1987 


3568 


5555 


Nonbeta 1583 


2797 


4380 


(B) Differentially expressed exon links (10% FDR), corresponding 
gene counts, and subset of candidate alternatively spliced genes 
(genes not DE but having DE links) — 12,334 genes tested 


Beta 


Beta 




overexpressed 


underexpressed 


Total 


Islet 






Links 11,843 


16,340 


28,183 


Gene count 21 30 


2964 


5072 


Gene count excluding DE genes 
Nonbeta 




2167 


Links 5688 


10,238 


15,926 


Gene count 1199 


1844 


3025 


Gene count excluding DE genes 




998 



Pairwise DESeq results (beta versus islet, beta versus nonbeta) of signifi- 
cant changes in magnitude of expression of genes (A), exon links (B) 
(reads spanning exon junctions). 



of RPKM and fold change (RPKM > 1 cutoff was used). This ap- 
proach successfully filters out highly expressed genes in contami- 
nating cell types (e.g., SST, GCG from somatostatin, and glucagon 
cells contaminating the beta cell population), otherwise mistaken 
as key players in the expression signature of beta cells. In addition 
to known beta-cell-specific transcripts {INS, IGF2, PDXl) we 
highlight further targets, some featured already in a microarray 
analysis of sorted islet cells (Dorrell et al. 2011b), e.g., RGS16, 
negative regulator of G-protein signaling, involved in endocrine 
pancreas development and re-expressed in adult cells in response 
to GLP-1 (Villasenor et al. 2010); ADCYAPl, pituitary adenylate 
cyclase activating polypeptide 1, involved in insulin secretion and 
beta cell regeneration/proliferation (Sakurai et al. 2011); HADH, 
hydroxyacyl-CoA dehydrogenase, negative regulator of insulin 
secretion (Hardy et al. 2007) associated with Alzheimer's (Nicolls 
et al. 2003), which is in turn associated with diabetes. Many other 
genes however have not been described before in the context of 
beta cells, including: NPTX2, neuronal pentraxin 2, found in 
neuronal cells and gliomas but also shown to be frequently down- 
regulated in pancreatic cancers (Zhang et al. 2012); TSPANl, tet- 
raspanin 1, which can associate with alpha6.betal integrin and 
promote FAK phosphorylation (Huang et al. 2008) shown by us to 
be involved in insulin secretion (Rondas et al. 2011); GPM6A, 
neuronal membrane glycoprotein of unknown function but 
identified as a beta cell marker in sorted mouse islet cells (Dorrell 
et al. 2011a); BMP5, bone morphogenic protein 5, implicated in 
pancreas and fetal beta cell development (Jiang et al. 2002); and 
P2RY1, purinergic receptor through which ADP and ATP modulate 
insulin secretion (Fernandez-Alvarez et al. 2001). 

Alternative splicing (AS), a common feature of most (—94%) 
eukaryotic genes contributing to tissue specificity (Pan et al. 2008) 
is also significantly enriched in beta-cell-specific genes {N = 200 
genes, 1.22-fold enrichment). To assess the extent of AS between 
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Table 2. Functional characteristics of beta-cell-specific and -nonspecific genes in islet context 

Beta-cell-specific genes in islet context {N = 526) Beta-cell-nonspecific genes in islet context (N = 614) 



Benjamini Fold Benjamini Fold 

Term P-value Count enrichment Term P-value Count enrichment 
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Top 20 most significantly overrepresented functional annotation terms (DAVID). 



islet cell types, we quantified all the exon links (reads spanning 
pairs of exons only) (H Ongen and ET Dermitzakis, unpubl.) in 
the beta, islet and nonbeta preparations and tested for differ- 
ential expression, those exon pairs making a minimum of five 
links in at least 90% of the samples (N = 154,190). We find 
28,183 DE links (10% FDR) between beta and islets and 15,926 
DE links between beta and nonbeta (Table IB), corresponding to 
5072 and 3025 genes, respectively. In 2167 and 998 of these 
genes, respectively, the DE links observed between the different 
islet cell types were not due to DE of the underlying genes. This 
implicates 8%-18% of the 12,334 genes thus tested as candi- 
dates for islet AS. 

Mapping RNA-seq data to the genome fails to identify reads 
that span exon junctions. To quantify this effect in our data and 
discover potential unannotated transcripts, we constructed all 
possible 96-mer exon junctions for each gene (read length-1 = 48 
bases on either side of the junction) and mapped all reads against 
them. On average, 2.9 million reads per beta cell sample did not 
map to the genome but were recovered by junction mapping 
(Supplemental Table 3), a small percentage (—1.36%) of the total 
number of reads. Nevertheless, we used these together with reads 
mapping better to exon junctions than the genome (better map- 
ping quality and larger alignment length) to assess the percentage 
of previously unobserved junctions. We identified 46,096 junc- 
tions in 7717 genes covered by at least 10 good quality reads (map 
quality greater than 10) in all beta cell samples, 657 of which (1.4% 
of all junctions) were not present in the transcriptome annotation 
(GENCODE vlO). These correspond to 465 genes enriched for 
posttranslational functions (acetylation: 8.9 X 10~^^; phospho- 
protein: 1.3 X 10~^^; ribosome: 1.6 X 10~^^) and possibly con- 
taining new beta-cell-specific transcripts. 

We next aimed to study the genetic control of the variation in 
beta cell and islet gene expression by integrating genome sequence 



level information when available. We extracted DNA from seven 
of the donors and sequenced them to medium (16x) coverage 
(Supplemental Fig. lb). The reads were aligned to the hgl9 refer- 
ence genome with BWA, we applied GATK (McKenna et al. 2010) 
base quality score recalibration, indel realignment and duplicate 
removal, followed by single nucleotide polymorphism (SNP) dis- 
covery and genotyping across all seven samples, simultaneously 
using variant quality score recalibration (DePristo et al. 2011). 
Subsequently, we imputed the variants on the 1000 Genomes 
reference panel and phased them with BEAGLE 3.3.2 (Browning 
and Browning 2007), resulting in 5,748,462 good confidence au- 
tosomal SNPs for analysis. For each individual, we tested the subset 
of heterozygous SNPs with good coverage in the RNA-seq data 
(beta, islet, and nonbeta) for allele-specific expression (ASE) (me- 
dian 23,358 sites per sample). On average, we observe that 9.3% of 
the heterozygous sites tested show significant ASE at 10% FDR 
(median 2626 sites per sample) (Supplemental Fig. 4). These cor- 
respond to a median of 1742 genes of 6756 genes tested per in- 
dividual, harboring at least one ASE SNP (equivalent to 59.18% of 
the total number of tested genes in all samples). A subset of these 
genes are of particular functional interest, having been linked to 
diabetes in GWAS: 15 of 23 TID genes tested show ASE, 20 of 28 
T2D genes and 15 of 18 genes associated with fasting glucose or 
insulin levels. Except for the subset of genes associated to fasting 
glucose or insulin (Fisher's P-value 0.02), this enrichment was not 
statistically significant. A substantial number of diabetes suscepti- 
bility genes are however expressed in beta cells, corroborating the 
GWAS predictions (32/40 TID, 37/39 T2D, 19/20 glucose or insulin 
levels) (Supplemental Fig. 5). Several of these {N= 9) are in fact beta- 
cell-specific transcripts: TlD-associated/N5, imprinted M£G3-DLiCi, 
GLIS3; T2D-associated VEGFA, SLC30A8; fasting glucose/insulin 
levels-associated yiDiM2yi, G6PC2, SLC2A2. Interestingly, two other 
genes, SH2B3 (TlD-linked) and IRSl (T2D-linked) are significantly 
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Table 3. Top 30 genes enriched in the beta cell population and nonbeta cell population ordered by RPKM and fold change, respectively 
Top 30 beta cell enriched genes Top 30 nonbeta cell enriched genes 



Ordered by fold change Ordered by fold change 
Ordered by RPKM Ordered by RPKM 



Gene 


Beta RPKM 


Gene 


Fold change 
Beta/ nonbeta 


Gene 


Nonbeta RPKM 


Gene 


Fold change 
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1 2,41 8.51 
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VEGFA 
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P2RY1 


9.99 


TMEM176B 


81.95 


TM4SF4 


9.80 
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APOH 


73.74 


SEPT9 


9.21 


GAD2 


96.59 


VATIL 


8.53 


LSR 


72.98 


NPNT 


9.01 


ALCAM 


95.79 


C8orf47 


8.38 


C19orf77 


67.72 


FOSL2 


8.61 


NKX6-1 


91.78 


PDXl 


8.27 


FOSL2 


64.69 


MAMLDl 


8.31 


GJD2 


87.30 


SLCl 7A6 


8.12 


LMNA 


58.09 


SERPINAl 


7.93 


TSC22D1 


86.05 


03FAR1 


8.07 


IRX2 


56.92 


SMOCl 


7.69 


CASR 


79.54 


SCD5 


8.05 


PAPPA2 


51.76 


FXYD5 


7.59 


SLC6A6 


77.41 


OLFM1 


8.05 


ARX 


50.97 


NEDD9 


7.49 


RP11-713B9.1 


70.46 


SLC35D3 


7.95 


\A/VA1 


50.20 


KBTBD10 


6.96 


BTG3 


64.76 


FAM115C 


7.90 


SMOC1 


40.44 


AHNAK 


6.60 


CABP7 


59.66 


TMEM150C 


7.89 


RAP1GAP 


40.29 


LBH 


6.36 


SORL1 


56.36 


CASR 


7.78 


KBTBD10 


39.63 


DUSP4 


6.20 



Table contains genes with RPKM > 1 consistently and significantly overexpressed in beta cells (order of expression: beta > islet > nonbeta) and nonbeta 
cells, respectively (order of expression: nonbeta > islet > beta), ordered in descending order of RPKM and fold change. 



overexpressed in the nonbeta population — given our data, it is 
tempting to speculate that alpha cell misregulation would be det- 
rimental in these cases. In some instances, we find evidence of 
significant allelic imbalance for the exact GWAS SNPs reported, 
e.g., rs5215 (KCNJll), a missense SNP associated to T2D risk is also 
an ASE variant in four beta cell samples, rs689 (INS), associated to 
TID autoantibodies is an ASE SNP in four samples, and rsl 1558471 
{SLC30A8) associated with fasting glucose-related traits and pro- 
insulin levels is an ASE SNP in 1 sample (the remaining samples 
were homozygous for the respective SNPs, hence could not be 
tested for ASE). These observations suggest that a subset of di- 
abetes-associated loci could contribute to disease progression via 
a change in beta cell gene expression. 

As expected for true positive ASE sites, the direction of effect is 
almost always consistent for the same individual between different 
tissues (Fig. 3). The high overall concordance between beta cell and 
islet expression data noted before is recapitulated at the sequence 
level. Taking difference of power into account (using pil to mea- 
sure sharing of statistical significance), we estimate a median 
89. 1 1% enrichment of significant beta cell ASE P- values in the islet 
and a 88.11% median enrichment vice versa (significant islet ASE 
in the beta cell). One of the explanations for the ASE events dis- 
covered could be the presence of nearby regulatory variants (ex- 
pression quantitative trait loci [eQTLs]). The current sample size is 
prohibitively small for eQTL discovery; however, in an effort to 
find those potentially active in beta cells, we integrated ASE sites 



with eQTLs discovered in the best-powered multitissue data set to 
date (MuTHER) (Grundberg et al. 2012), which reported 3148 
eQTLs in fat, 3953 eQTLs in lymphoblastoid cell lines (LCLs), and 
2515 eQTLs in skin. For each gene and tissue, we phased the top 
eQTL SNP and top ASE SNP per individual, being thus able to test in 
our data 1806 of the eQTLs discovered in fat, 2272 eQTLs discov- 
ered in LCLs, and 1400 eQTLs discovered in skin. We observe 
a significantly greater deviation from the expected reference/total 
allelic ratio (0.5) in individuals heterozygous for the eQTLs com- 
pared to homozygotes (Mann-Whitney P-value < 6.18 X 10~^^), 
indicating that a proportion of these regulatory variants are also 
active in beta cells (Fig. 4). To shortlist them, we selected cases in 
which the direction of the eQTL effect was in agreement with the 
significant (ASE P-value < 0.01) allelic imbalance prediction. We 
thus report 254 candidate beta eQTL genes using the data from fat 
cells, 294 genes from LCLs, and 198 from skin. Reassuringly, 303 of 
the 510 total genes detected as such (59.4%) are shared in at least 
two of the MuTHER tissues, a significant enrichment (Fisher's 
P-value: 6.2 X 10"^^) over the 44.8% eQTL sharing we began with. 
This suggests that these eQTLs are not highly tissue specific and 
therefore likely to be also present in beta cells. 

Discussion 

Pancreatic islets are essential for regulating and maintaining glu- 
cose homeostasis. This functional role is fulfilled by at least five 
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Figure 3. Sharing of significant ASE effects between beta cells and islets (sample P775). Left panel histograms show the enrichment (pil ) of significant 
ASE P-values in the isletsfor ASE sites discovered in beta cells and vice versa (beta cell ASE P-values of ASE sites discovered in islets). Right pane\ scatterplots 
display the direction of ASE effects (ratio of reference allele count to the total number of reads covering that site) between the two cell types, almost always 
in concordance. 



distinct hormone-producing cell types, which act differently but 
synergistically to help maintain appropriate glycemic levels in 
healthy individuals. Therefore, uncovering the molecular identity 
of ideally each of these cell types would ensure a better under- 
standing of the mechanisms through which they become de- 
fective in diabetic patients. We present here the first ever unbiased 
transcriptome analysis of the differentiated human beta cell, the 
most abundant pancreatic islet cell type that secretes insulin and is 
severely affected in common forms of diabetes. The advanced cell 
sorting protocol, immediate RNA extraction, and unprecedented 
sequencing resolution makes this unique data set the most faithful 
representation of the human beta cell transcriptome to date. Im- 
portantly, we compare this to the expression profile of whole islets 
and beta cell depleted islet preparations ("nonbeta"), uncovering 
unique beta-cell-specific features in contrast to other pancreatic 
endocrine cell types. In the absence of the three-way beta-islet- 
nonbeta transcriptome comparison enabled by our study design, 
such discoveries would remain concealed. 

Overall, we observe that islets are a good proxy for beta cell 
gene expression, provided they are of high enough purity (>85%). 
However, important biological insights into cell-type-specific ex- 
pression signatures are revealed when comparing detailed RNA 



profiles of purified beta cells with those of whole islets or "non- 
beta" cells. We find substantial evidence for differential expression 
between beta cells and islets, with more than 5000 genes showing 
significant changes in transcript abundance between the two. 
Genes consistently overexpressed in beta cells are enriched 
for neuron-like properties, a similarity suspected earlier that we 
now independently confirm. These important biological insights 
would not be revealed by islet expression studies alone. Further, we 
corroborate current GWAS results by reporting beta cell expression 
of a large number of genes (80%-95%) previously associated with 
type 1 or type 2 diabetes. Differential splicing is also present among 
islet cell types, with up to 18% of genes undergoing this process in 
our data. Altogether, this draws attention to the general limitations 
of analyzing a mixture of cell populations as opposed to prepara- 
tions highly enriched for a single cell type of interest. 

Finally, we describe here the first set of genetic variants 
controlling interindividual variability in beta cell gene expres- 
sion. Despite the small sample size and consequent limited 
number of informative genetic variants, we are able to report al- 
lelic and genotypic expression differences, offering a first glance 
at the genetics behind human beta cell function. Our evidence for 
diabetes-associated loci expressed in an allele-specific manner 
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Figure 4. Candidate beta cell c/s eQTLs discovered initially in other tissues (fat, LCL, skin). Top panel 
boxplots show the deviations of allelic ratios (reference/total) from the expected 0.5 in ASE individuals 
grouped by eQTL genotype, with heterozygotes having markedly higher effects on ASE ratios compared 
to homozygotes. Bottom scatterplots show the beta coefficients of the MuTHER eQTLs and the corre- 
sponding beta ASE ratios for the selected candidate beta cell regulatory variants. 



flags potential contributions to disease effects by genetically or 
epigenetically driven expression changes in beta cells. These are 
encouraging results, which remain to be refined and augmented 
by a much-awaited beta cell eQTL study. 

Taken together, the data will serve the diabetes research 
community manifold ways, including toward the validation of 
future stem-cell derived or other surrogate beta cells for research 
purposes or regenerative medicine. 

Methods 

Sample collection 

Islets isolated from cadaveric pancreas were provided by the Cell 
Isolation and Transplant Center, Department of Surgery, Geneva 
University Hospital (Drs. T. Berney and D. Bosco) through the Ju- 
venile Diabetes Research Foundation (JDRF) award 31-2008-416 
(ECIT Islet for Basic Research Program). 

Beta cell sorting 

Islets were dispersed into single cells, stained with Newport Green, 
and sorted into "beta" and "nonbeta" populations as described pre- 
viously (Parnaud et al. 2008). The proportion of beta (insulin), alpha 
(glucagon), and delta (somatostatin) cells in each population (as 
percentage of total cells) was determined by immunofluorescence. 

RNA extraction 

Sorted beta cells, nonbeta cells, and islets were centrifuged; the su- 
pernatant was removed; and the pellet disrupted in RLT buffer 
(RNeasy, Qiagen). Total RNA was prepared according to the standard 
RNeasy protocol. 



skin DNA extraction 

DNA was prepared according to the stan- 
dard DNeasy (Qiagen) protocol. 



Library preparation and sequencing 

Total RNA and genomic DNA libraries 
were constructed following customary 
lUumina TruSeq protocols for next- 
generation sequencing. PolyA-selected 
mRNAs were purified, size-fractioned, 
and subsequently converted to single- 
stranded cDNA by random hexamer 
priming. Following second-strand syn- 
thesis, double-stranded cDNAs were 
blunt-end fragmented and indexed us- 
ing adapter ligation, after which they 
were amplified and sequenced accord- 
ing to protocol. RNA libraries were 
49-bp paired-end sequenced with one 
or a maximum of two samples per 
HiSeq 2000 lane. DNA libraries for 
whole-genome sequencing were con- 
structed similarly, but starting directly 
from fragmented genomic DNA. The 
seven DNA libraries were pooled and 
sequenced in a total of nine HiSeq lanes 
each (one control lane of 49-bp paired- 
end reads and eight lanes of 95-bp paired- 
end sequencing). Standard quality checks 
for material degradation (Bioanalyzer, 
Agilent Technologies) and concentration 
(Qubit, Life Technologies) were done before and after library con- 
struction, ensuring that samples are suitable for sequencing. 

Read mapping 

Paired-end reads were mapped to the human genome (assembly 
version GRCh37/hgl9) using BWA and allowing a maximum in- 
sert size of 500 kb (for instances when not enough good align- 
ments could be used to infer the insert size distribution). RNA-seq 
reads were subsequently filtered for correct orientation of the 
mapped mate pairs with an insert size <500 kb and a minimum 
mapping quality score of 10. 

Expression quantification 

GENCODE annotation vlO was used to assign filtered reads to their 
corresponding exons and genes. We filtered the annotation to in- 
clude transcribed biotypes (coding and noncoding), i.e., all protein 
coding genes, lincRNAs, processed transcripts, noncoding, poly- 
morphic pseudogenes, transcribed processed pseudogenes, and 
transcribed unprocessed pseudogenes. For each gene, we processed 
the exons from all transcripts, which we quantified by taking into 
account only filtered reads as above, in which both mates of a pair 
map to exons of the same gene. The gene counts were incremented 
nonredundantly, i.e., reads overlapping two exons are counted 
once to the total count sum per gene. Resulting raw gene counts 
were normalized to gene length (sum of exons) and sequencing 
depth, i.e., reads per kilobase per million (RPKM) mapped reads. 

Differential expression analysis 

Raw gene/exon link counts were modeled using a negative bi- 
nomial distribution and tested for differential gene expression 
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with the DESeq R package. Following author's guidelines, we 
estimated the size factor for each sample from the count data, 
followed by dispersion estimations for each gene/link. Sub- 
sequently, we tested for differential expression between condi- 
tions. We grouped all beta cell, islet, nonbeta, and the 18 dif- 
ferent tissues (adipose, bladder, brain, cervix, colon, esophagus, 
heart, kidney, liver, lung, ovary, placenta, prostate, spleen, testes, 
thymus, thyroid, trachea) in separate conditions (beta, islet, 
nonbeta, other) and tested for differential expression: beta versus 
other, beta versus islet, and beta versus nonbeta. For the pairwise 
DESeq tests of islet-derived data only, we included islet purity as 
a covariate: We fitted two generalized linear models and com- 
pared them to see whether the purity factor improved the fit (i.e., 
had a significant effect). The P-values of the comparisons were 
then adjusted for multiple testing using the Benjamini-Hochberg 
method. Adjusted P-values < 0.1 were considered significant. In- 
troducing the islet purity as a covariate improved the fit in all 
cases. 

Quantification of exon links 

We made use of the paired-end nature of our RNA-seq experi- 
ment to relatively quantify splicing events. Specifically, we used 
filtered mate pairs (correctly paired and 10 minimum mapping 
quality) where one mate maps to one exon and the other mate to 
a different exon to count "links" between two exons. The first 
exon in a link was referred to as the "primary exon." Over- 
lapping exons were grouped into "exon groups" and unique 
portions of each exon in an exon group were identified. These 
unique portions were subsequently used to assign reads to an 
exon. 

Generation of exon junctions 

We have constructed all possible 96-mer junctions (flanking 
sequence on either side = read length - 1) per gene using 
the transcriptome information in Ensembl v65 (Flicek et al. 
2013). For each gene, we first kept track of all the unique start 
and end positions of all its exons. Subsequently, for non- 
redundant exon-exon pairs with valid positions (i.e., starting 
position of second exon greater than ending position of first 
exon in junction), we constructed every possible pairwise 
combination within the length limits of the exons involved. 
For junctions failing this criterion due to short exons on one or 
both ends, the sequence was expanded into flanking exons. For 
every valid junction, we recorded whether it forms part of any 
known transcript or not. 

Functional annotation analysis 

Gene functional analyses were performed using The Database for 
Annotation, Visualization and Integrated Discovery (DAVID) v6.7. 
We have used the Functional Annotation Chart tool to analyze our 
genes of interest (e.g., beta cell specific) in the context of the whole 
human genome. DAVID 's default annotation categories were used 
and enrichment results sorted by significance of Benjamini cor- 
rected P-values. 



SNP calling 

Variant calling was performed with the Genome Analysis Toolkit 
(GATK) 1.5.31 following the Best Practice Variant Detection v3. 
Reads were aligned to the hgl9 reference genome with BWA, and 
bam files from each lane were merged into one bam per sample. For 
each sample we removed duplicates, realigned the reads around 



known indels from the 1000 Genomes, applied GATK base quality 
score recalibration, followed then by SNP discovery and genotyp- 
ing across all seven samples simultaneously with variant quality 
score recalibration. We used a confidence score threshold of 30 for 
variant detection and a minimum base quality of 1 7 to consider 
a base for calling. Good confidence (1% FDR) SNP calls were then 
imputed on the 1000 Genomes reference panel and phased with 
BEAGLE 3.3.2. 

ASE analysis 

Allele-specific expression (ASE) analysis was performed per- 
individual on the subset of heterozygous sites with a minimum 
coverage of 16 reads per site filtered for mapping quality of 10. 
Problematic positions were filtered out as follows to exclude sites 
susceptible to allelic mapping bias: (1) sites with UCSC 50-bp 
mapability less than one, implying that the 50-bp flanking region 
of the site is nonunique in the genome and collapsed repeat re- 
gions were excluded; and (2) simulated RNA-seq reads overlapping 
the site and showing >5% difference in the mapping of reads that 
carry the reference or nonreference allele were also excluded. Next, 
we calculated the expected reference allele ratio for each in- 
dividual by summing up reads across all sites separately for each 
SNP allele combination after down-sampling reads of sites in the 
top 25th coverage percentile in order to avoid the highest cov- 
ered sites having a disproportionally large effect on the ratios. 
These expected REF/TOTAL ratios (in the range of 0.489-0.518) 
correct for the remaining slight genome-wide mapping bias in 
each individual. In a binomial test of the REF/NONREF allele 
counts, we used these individual-specific expected ratios to 
weight the occurrence of each allele and determine sites with 
>16 reads in each individual deviating from the biased-corrected 
expected allelic ratios. We called sites in significant ASE if 
resulting P-values < 0.01. 

eQTL integration 

We obtained cis eQTLs (1% FDR) detected in the MuTHER study 
where 856 female twins were profiled for gene expression with 
microarrays in three tissues: fat, lymphoblastoid cell lines (LCLs), 
and skin. These amounted to 3148 genes in fat, 3953 genes in 
LCLs, and 2515 genes in skin having cis eQTLs. We phased the top 
SNPs per gene in each of these data sets with top beta cell ASE sites 
for the same genes. We contrasted the beta cell allelic ratios for 
individuals heterozygous for the MuTHER eQTLs compared to 
homozygotes, expecting smaller effects in homozygotes, where 
the eQTL itself (if active also in beta cells) would have no effect on 
expression. 

Data access 

Genotype and sequence data have been deposited at the European 
Genome-phenome Archive (EGA; http://www.ebi.ac.uk/ega/), 
which is hosted by the European Bioinformatics Institute (EBI), 
under accession number EGAS00001000442. Anonymized se- 
quence files (matching the reference genome) are freely and pub- 
licly available on our ftp server (ftp://jungle.unige.ch/) as well as 
Ensembl. 
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